Rijkstrainees - Statistical Programming in R

Today

Course Materials

This lecture

  • Pipes

  • Data manipulation

  • Basic analysis (correlation & t-test)

  • Data visualization with ggplot2

Two learning curves: Practicals with and without answers

Packages we use in these slides

library(dplyr)      # data manipulation
library(magrittr)   # pipes
library(mice)       # for the boys data
library(ggplot2)    # visualization
library(DT)         # fancy JS/html tables
library(reshape2)   # melt stuff

Opinion

The data

head(boys)
##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

Goal

At the end of this lecture we aim to understand what happens in

ggplot(mutate(na.omit(select(boys, age, hgt, reg)), height_meters = hgt/100), 
       aes(height_meters, age)) + geom_point(aes(group = reg))

Pipes

This is a pipe:

boys %>% 
  select(is.numeric) %>% 
  cor(use = "pairwise.complete.obs") %>% 
  round(3)

It effectively replaces round(cor(select(boys, is.numeric), use = "pairwise.complete.obs"), digits = 3).

Why are pipes useful?

Benefit: a single object in memory that is easy to interpret Your code becomes more readable:

  • data operations are structured from left-to-right and not from in-to-out
  • nested function calls are avoided
  • local variables and copied objects are avoided
  • easy to add steps in the sequence

What do pipes do:

  • f(x) becomes x %>% f()
boys$age %>% mean()
## [1] 9.158866
  • f(x, y) becomes x %>% f(y)
boys %>% head(n = 1)
##     age  hgt  wgt   bmi   hc  gen  phb tv   reg
## 3 0.035 50.1 3.65 14.54 33.7 <NA> <NA> NA south
  • h(g(f(x))) becomes x %>% f %>% g %>% h
boys %>% select(is.numeric) %>% na.omit() %>% colMeans
##       age       hgt       wgt       bmi        hc        tv 
##  14.07467 165.42411  54.01027  19.08348  55.28884  11.89732

More pipe stuff

The standard %>% pipe

The %$% pipe

The role of . in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

boys %>%
  plot(bmi ~ age, data = .)

VS

boys %$%
  plot(bmi ~ age)

The . can be used as a placeholder in the pipe.

Data manipulation

Performing a t-test in a pipe

boys %>%
  mutate(ovwgt = bmi > 25) %$% 
  t.test(age ~ ovwgt)
## 
##  Welch Two Sample t-test
## 
## data:  age by ovwgt
## t = -15.971, df = 32.993, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
##  -9.393179 -7.270438
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            9.103392           17.435200

is the same as

t.test(age ~ (bmi > 25), data = boys)

Melting

boys %>% 
  select(reg, age) %>% 
  melt(id.vars = "reg", variable.name = "variable", value.name = "value") %>%
  datatable(options = list(pageLength = 25, scrollY = "250px"))

Calculate statistics

boys %>% 
  select(reg, age) %>% 
  melt(id.vars = "reg", variable.name = "variable", value.name = "value") %>%
  group_by(variable, reg) %>% 
  summarise_all(list(mean = mean, sd = sd), na.rm = TRUE) %>% 
  datatable(options = list(pageLength = 25, scrollY = "200px"))

Multiple columns

boys %>% 
  select(reg, where(is.numeric)) %>% 
  melt(id.vars = "reg", variable.name = "variable", value.name = "value") %>%
  group_by(variable, reg) %>% 
  summarise_all(list(mean = mean, sd = sd), na.rm = TRUE) %>% 
  datatable(options = list(pageLength = 25, scrollY = "200px"))

Mutate: add

boys %>% 
  mutate(bmi_calc = wgt / (hgt/100)^2) %>% 
  select(bmi, bmi_calc) %>% 
  head()
##      bmi bmi_calc
## 3  14.54 14.54177
## 4  11.77 11.77395
## 18 12.56 12.56000
## 23 14.37 14.37589
## 28 15.21 15.21361
## 36 15.11 15.11241

Mutate: remove

boys %>% 
  mutate(reg = NULL,
         gen = NULL, 
         phb = NULL) %>% 
  head()
##      age  hgt   wgt   bmi   hc tv
## 3  0.035 50.1 3.650 14.54 33.7 NA
## 4  0.038 53.5 3.370 11.77 35.0 NA
## 18 0.057 50.0 3.140 12.56 35.2 NA
## 23 0.060 54.5 4.270 14.37 36.7 NA
## 28 0.062 57.5 5.030 15.21 37.3 NA
## 36 0.068 55.5 4.655 15.11 37.0 NA

Mutate: change

boys %>% 
  mutate(hgt = hgt/100) %>% 
  tail()
##         age   hgt  wgt   bmi   hc  gen  phb tv   reg
## 7410 20.372 1.887 59.8 16.79 55.2 <NA> <NA> NA  west
## 7418 20.429 1.811 67.2 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 1.891 88.0 24.60   NA <NA> <NA> NA  west
## 7447 20.780 1.935 75.4 20.13   NA <NA> <NA> NA  west
## 7451 20.813 1.890 78.0 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 1.818 76.5 23.14   NA <NA> <NA> NA  east

Mutate: transform column

boys %$% table(reg)
## reg
## north  east  west south  city 
##    81   161   239   191    73
boys %>% 
  select(hgt, reg) %>%
  mutate(across(!hgt, as.numeric)) %$% 
  table(reg)
## reg
##   1   2   3   4   5 
##  81 161 239 191  73

Data visualization with ggplot2

The anscombe data

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

The same statistical properties

anscombe %>% colMeans()
##       x1       x2       x3       x4       y1       y2       y3       y4 
## 9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
anscombe %>% cor() %>% round(digits = 3) %>% .[1:4, 5:8]
##        y1     y2     y3     y4
## x1  0.816  0.816  0.816 -0.314
## x2  0.816  0.816  0.816 -0.314
## x3  0.816  0.816  0.816 -0.314
## x4 -0.529 -0.718 -0.345  0.817
anscombe %>% var() %>% round(digits = 3) %>% .[1:4, 5:8]
##        y1     y2     y3     y4
## x1  5.501  5.500  5.497 -2.115
## x2  5.501  5.500  5.497 -2.115
## x3  5.501  5.500  5.497 -2.115
## x4 -3.565 -4.841 -2.321  5.499

Fitting a line

anscombe %>%
  ggplot(aes(y1, x1)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Why visualise?

  • We can process a lot of information quickly with our eyes
  • Plots give us information about
    • Distribution / shape
    • Irregularities
    • Assumptions
    • Intuitions
  • Summary statistics, correlations, parameters, model tests, p-values do not tell the whole story

ALWAYS plot your data!

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

What is ggplot2?

Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.

With ggplot2 you

  1. provide the data
  2. define how to map variables to aesthetics
  3. state which geometric object to display
  4. (optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

mice::boys %>%
  ggplot()

2: map variable to aesthetics

mice::boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

mice::boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

gg <- 
  mice::boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

gg <- gg + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

gg <- gg + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(gg)

Why this syntax?

Revisit the start

ggplot(mutate(na.omit(select(boys, age, hgt, reg)), height_meters = hgt/100), 
       aes(height_meters, age)) + geom_point(aes(group = reg))

Is the same as

boys %>% 
  select(age, hgt, reg) %>% # select features
  na.omit() %>% # remove missings. NAUGHTY!
  mutate(height_meters = hgt/100) %>% # transform height
  ggplot(aes(x = height_meters, y = age)) + # define plot aes
  geom_point(aes(group = reg)) # add geom